• PROJECT OBJECTIVE: To understand K-means Clustering by applying on the Car Dataset to segment the cars into various categories.

In [146]:
import numpy as np
import pandas as pd
import json
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans

1. Data Understanding & Exploration:

1.A. Read ‘Car name.csv’ as a DataFrame and assign it to a variable

In [147]:
d1= pd.read_csv("C:\\Users\\HARITHA\\Car name.csv")
In [148]:
d1.head()
Out[148]:
car_name
0 chevrolet chevelle malibu
1 buick skylark 320
2 plymouth satellite
3 amc rebel sst
4 ford torino

1.B. Read ‘Car-Attributes.json as a DataFrame and assign it to a variable. [

In [149]:
d2 = pd.read_json("Car-Attributes.json")
In [150]:
d2.head()
Out[150]:
mpg cyl disp hp wt acc yr origin
0 18.0 8 307.0 130 3504 12.0 70 1
1 15.0 8 350.0 165 3693 11.5 70 1
2 18.0 8 318.0 150 3436 11.0 70 1
3 16.0 8 304.0 150 3433 12.0 70 1
4 17.0 8 302.0 140 3449 10.5 70 1

1.C. Merge both the DataFrames together to form a single DataFrame

In [151]:
df=d1.join(d2)
In [152]:
df.head()
Out[152]:
car_name mpg cyl disp hp wt acc yr origin
0 chevrolet chevelle malibu 18.0 8 307.0 130 3504 12.0 70 1
1 buick skylark 320 15.0 8 350.0 165 3693 11.5 70 1
2 plymouth satellite 18.0 8 318.0 150 3436 11.0 70 1
3 amc rebel sst 16.0 8 304.0 150 3433 12.0 70 1
4 ford torino 17.0 8 302.0 140 3449 10.5 70 1
In [153]:
df.shape
Out[153]:
(398, 9)

1.D. Print 5 point summary of the numerical features and share insights.

In [154]:
df.describe().T
Out[154]:
count mean std min 25% 50% 75% max
mpg 398.0 23.514573 7.815984 9.0 17.500 23.0 29.000 46.6
cyl 398.0 5.454774 1.701004 3.0 4.000 4.0 8.000 8.0
disp 398.0 193.425879 104.269838 68.0 104.250 148.5 262.000 455.0
wt 398.0 2970.424623 846.841774 1613.0 2223.750 2803.5 3608.000 5140.0
acc 398.0 15.568090 2.757689 8.0 13.825 15.5 17.175 24.8
yr 398.0 76.010050 3.697627 70.0 73.000 76.0 79.000 82.0
origin 398.0 1.572864 0.802055 1.0 1.000 1.0 2.000 3.0
  • insights:
In [155]:
sns.boxplot(df,orient='h');

2. Data Preparation & Analysis:

2.A. Check and print feature-wise percentage of missing values present in the data and impute with the best suitable approach.

In [156]:
percent_missing = df.isnull().sum() * 100 / len(df)
In [157]:
percent_missing 
Out[157]:
car_name    0.0
mpg         0.0
cyl         0.0
disp        0.0
hp          0.0
wt          0.0
acc         0.0
yr          0.0
origin      0.0
dtype: float64
In [158]:
 df.isnull().sum()
Out[158]:
car_name    0
mpg         0
cyl         0
disp        0
hp          0
wt          0
acc         0
yr          0
origin      0
dtype: int64

2.B. Check for duplicate values in the data and impute with the best suitable approach

In [159]:
df.duplicated().sum()
Out[159]:
0

2.C. Plot a pairplot for all features.

In [160]:
sns.pairplot(df,diag_kind='kde');

2.D. Visualize a scatterplot for ‘wt’ and ‘disp’. Datapoints should be distinguishable by ‘cyl’.

In [161]:
sns.scatterplot(data=df,x='wt',y='disp',hue='cyl',palette='dark');

2.E. Share insights for Q2.d

  • if the wt increases the disp increases.

2.F. Visualize a scatterplot for ‘wt’ and ’mpg’. Datapoints should be distinguishable by ‘cyl’

In [162]:
sns.scatterplot(data=df,x='wt',y='mpg',hue='cyl',palette='bright');

2.G. Share insights for Q2.f

  • when the wt increases mpg gets decreased, if the wt is low the mpg will be high

2.H. Check for unexpected values in all the features and datapoints with such values.

In [163]:
df['hp'].sample(25)
Out[163]:
141     83
139    140
283     90
255     88
146     75
233     78
55      60
363    110
108     88
371     84
163     95
23     113
67     208
355     75
0      130
160    110
241     97
48      88
218     58
367     88
351     65
185     79
162    110
213    145
83      80
Name: hp, dtype: object
In [164]:
df[df['hp']=="?"]
Out[164]:
car_name mpg cyl disp hp wt acc yr origin
32 ford pinto 25.0 4 98.0 ? 2046 19.0 71 1
126 ford maverick 21.0 6 200.0 ? 2875 17.0 74 1
330 renault lecar deluxe 40.9 4 85.0 ? 1835 17.3 80 2
336 ford mustang cobra 23.6 4 140.0 ? 2905 14.3 80 1
354 renault 18i 34.5 4 100.0 ? 2320 15.8 81 2
374 amc concord dl 23.0 4 151.0 ? 3035 20.5 82 1
In [165]:
df['hp'].replace("?",np.nan, inplace=True)
In [166]:
df['hp'].iloc[336]
Out[166]:
nan
In [167]:
df['hp'].dropna(inplace=True)
In [168]:
df['hp'].iloc[336]
Out[168]:
92.0
In [169]:
df.shape
Out[169]:
(398, 9)
In [170]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   car_name  398 non-null    object 
 1   mpg       398 non-null    float64
 2   cyl       398 non-null    int64  
 3   disp      398 non-null    float64
 4   hp        392 non-null    float64
 5   wt        398 non-null    int64  
 6   acc       398 non-null    float64
 7   yr        398 non-null    int64  
 8   origin    398 non-null    int64  
dtypes: float64(4), int64(4), object(1)
memory usage: 28.1+ KB
In [171]:
df['hp'].fillna((df['hp'].median()), inplace=True)
df['hp'] = df['hp'].astype('float')
In [172]:
df.dtypes
Out[172]:
car_name     object
mpg         float64
cyl           int64
disp        float64
hp          float64
wt            int64
acc         float64
yr            int64
origin        int64
dtype: object
In [173]:
# importing the StandardScaler Module
from sklearn.preprocessing import StandardScaler
In [174]:
# Creating an object for the StandardScaler function
X = StandardScaler()
In [175]:
scaled_df = X.fit_transform(df.iloc[:,1:9])
In [176]:
scaled_df
Out[176]:
array([[-0.7064387 ,  1.49819126,  1.0906037 , ..., -1.29549834,
        -1.62742629, -0.71514478],
       [-1.09075062,  1.49819126,  1.5035143 , ..., -1.47703779,
        -1.62742629, -0.71514478],
       [-0.7064387 ,  1.49819126,  1.19623199, ..., -1.65857724,
        -1.62742629, -0.71514478],
       ...,
       [ 1.08701694, -0.85632057, -0.56103873, ..., -1.4407299 ,
         1.62198339, -0.71514478],
       [ 0.57460104, -0.85632057, -0.70507731, ...,  1.10082237,
         1.62198339, -0.71514478],
       [ 0.95891297, -0.85632057, -0.71467988, ...,  1.39128549,
         1.62198339, -0.71514478]])

3. Clustering:

3.A. Apply K-Means clustering for 2 to 10 clusters.

In [177]:
df.isnull().sum()
Out[177]:
car_name    0
mpg         0
cyl         0
disp        0
hp          0
wt          0
acc         0
yr          0
origin      0
dtype: int64
In [178]:
df.dropna(inplace=True)
In [179]:
df.isnull().sum()
Out[179]:
car_name    0
mpg         0
cyl         0
disp        0
hp          0
wt          0
acc         0
yr          0
origin      0
dtype: int64
In [180]:
df.groupby(df['cyl']).mean()
Out[180]:
mpg disp hp wt acc yr origin
cyl
3 20.550000 72.500000 99.250000 2398.500000 13.250000 75.500000 3.000000
4 29.286765 109.796569 78.654412 2308.127451 16.601471 77.073529 1.985294
5 27.366667 145.000000 82.333333 3103.333333 18.633333 79.000000 2.000000
6 19.985714 218.142857 101.410714 3198.226190 16.263095 75.928571 1.190476
8 14.963107 345.009709 158.300971 4114.718447 12.955340 73.902913 1.000000
In [181]:
# Calculate age of vehicle
df['age'] = 83-df['yr']
df.head()
Out[181]:
car_name mpg cyl disp hp wt acc yr origin age
0 chevrolet chevelle malibu 18.0 8 307.0 130.0 3504 12.0 70 1 13
1 buick skylark 320 15.0 8 350.0 165.0 3693 11.5 70 1 13
2 plymouth satellite 18.0 8 318.0 150.0 3436 11.0 70 1 13
3 amc rebel sst 16.0 8 304.0 150.0 3433 12.0 70 1 13
4 ford torino 17.0 8 302.0 140.0 3449 10.5 70 1 13
In [182]:
#Convert origing into dummy variables (This again is subjected to business knowledge. We might drop this variable as well
# Inclusion is more to demonstrate on how to use categorical data)

one_hot = pd.get_dummies(df['origin'])
one_hot = one_hot.add_prefix('origin_')

# merge in main data frame
df = df.join(one_hot)
df.head()
Out[182]:
car_name mpg cyl disp hp wt acc yr origin age origin_1 origin_2 origin_3
0 chevrolet chevelle malibu 18.0 8 307.0 130.0 3504 12.0 70 1 13 1 0 0
1 buick skylark 320 15.0 8 350.0 165.0 3693 11.5 70 1 13 1 0 0
2 plymouth satellite 18.0 8 318.0 150.0 3436 11.0 70 1 13 1 0 0
3 amc rebel sst 16.0 8 304.0 150.0 3433 12.0 70 1 13 1 0 0
4 ford torino 17.0 8 302.0 140.0 3449 10.5 70 1 13 1 0 0
In [183]:
df_new = df.drop(['yr','origin','car_name'], axis =1)
df_new.head()
Out[183]:
mpg cyl disp hp wt acc age origin_1 origin_2 origin_3
0 18.0 8 307.0 130.0 3504 12.0 13 1 0 0
1 15.0 8 350.0 165.0 3693 11.5 13 1 0 0
2 18.0 8 318.0 150.0 3436 11.0 13 1 0 0
3 16.0 8 304.0 150.0 3433 12.0 13 1 0 0
4 17.0 8 302.0 140.0 3449 10.5 13 1 0 0
In [184]:
sns.boxplot(data=df_new);
In [185]:
# We could see some outliers for mpg,hp and acc
sns.boxplot(y=df_new['mpg']);
In [186]:
sns.boxplot(y=df_new['hp']);
In [187]:
df_new['hp'] = np.log(df_new['hp'])
df_new['acc'] = np.log(df_new['acc'])
df_new['mpg'] = np.log(df_new['mpg'])
df_new.head()
Out[187]:
mpg cyl disp hp wt acc age origin_1 origin_2 origin_3
0 2.890372 8 307.0 4.867534 3504 2.484907 13 1 0 0
1 2.708050 8 350.0 5.105945 3693 2.442347 13 1 0 0
2 2.890372 8 318.0 5.010635 3436 2.397895 13 1 0 0
3 2.772589 8 304.0 5.010635 3433 2.484907 13 1 0 0
4 2.833213 8 302.0 4.941642 3449 2.351375 13 1 0 0
In [188]:
sns.boxplot(data=df_new);
In [189]:
from scipy.stats import zscore
df_new.dtypes
numeric_cols = df_new.select_dtypes(include=[np.int64, np.float64]).columns
numeric_cols
df_new[numeric_cols] =df_new[numeric_cols].apply(zscore)
In [190]:
df_new.head()
Out[190]:
mpg cyl disp hp wt acc age origin_1 origin_2 origin_3
0 -0.622035 1.498191 1.090604 0.823608 0.630870 -1.353748 1.627426 1 0 0
1 -1.159493 1.498191 1.503514 1.523992 0.854333 -1.589535 1.627426 1 0 0
2 -0.622035 1.498191 1.196232 1.243998 0.550470 -1.835805 1.627426 1 0 0
3 -0.969242 1.498191 1.061796 1.243998 0.546923 -1.353748 1.627426 1 0 0
4 -0.790530 1.498191 1.042591 1.041316 0.565841 -2.093533 1.627426 1 0 0
In [191]:
cluster_range = range(2,11)
cluster_errors = []
for num_clusters in cluster_range:
    clusters = KMeans(num_clusters, n_init = 5)
    clusters.fit(df_new)
    labels = clusters.labels_
    centroids = clusters.cluster_centers_
    cluster_errors.append(clusters.inertia_)

clusters_df = pd.DataFrame({"num_clusters": cluster_range, "cluster_errors": cluster_errors})
clusters_df
Out[191]:
num_clusters cluster_errors
0 2 1435.661694
1 3 1072.434353
2 4 873.977856
3 5 786.882347
4 6 719.246694
5 7 669.071352
6 8 626.382044
7 9 592.177064
8 10 562.011930

3.B. Plot a visual and find elbow point

In [192]:
from matplotlib import cm

plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )
Out[192]:
[<matplotlib.lines.Line2D at 0x1b650db9d08>]

3.C. On the above visual, highlight which are the possible Elbow points

  • Based on the above visual the possible elbow points would be 3,4,5

3.D. Train a K-means clustering model once again on the optimal number of clusters

In [193]:
#Set the value of k=4
kmeans = KMeans(n_clusters=4, n_init = 15, random_state=2345)
In [194]:
kmeans.fit(df_new)
Out[194]:
KMeans(n_clusters=4, n_init=15, random_state=2345)
In [195]:
centroids = kmeans.cluster_centers_
In [196]:
centroids
Out[196]:
array([[ 3.37373388e-01, -8.68583653e-01, -8.24059183e-01,
        -5.45072859e-01, -7.70282170e-01,  3.08194952e-01,
         6.57116455e-01,  2.50000000e-01,  4.68750000e-01,
         2.81250000e-01],
       [-1.30629972e+00,  1.49819126e+00,  1.50392292e+00,
         1.44265580e+00,  1.40409797e+00, -1.15022628e+00,
         6.88323847e-01,  1.00000000e+00, -2.49800181e-16,
         8.32667268e-17],
       [ 1.08052956e+00, -8.21103514e-01, -7.73033976e-01,
        -7.99132185e-01, -7.48370271e-01,  4.26220045e-01,
        -1.08735830e+00,  4.10256410e-01,  1.88034188e-01,
         4.01709402e-01],
       [-3.95798123e-01,  4.24430369e-01,  3.09735186e-01,
         1.12261708e-01,  3.24408797e-01,  3.15019884e-01,
        -6.20550104e-03,  9.12087912e-01,  3.29670330e-02,
         5.49450549e-02]])
In [197]:
#Calculate the centroids for the columns to profile
centroid_df = pd.DataFrame(centroids, columns = list(df_new) )
In [198]:
print(centroid_df)
        mpg       cyl      disp        hp        wt       acc       age  \
0  0.337373 -0.868584 -0.824059 -0.545073 -0.770282  0.308195  0.657116   
1 -1.306300  1.498191  1.503923  1.442656  1.404098 -1.150226  0.688324   
2  1.080530 -0.821104 -0.773034 -0.799132 -0.748370  0.426220 -1.087358   
3 -0.395798  0.424430  0.309735  0.112262  0.324409  0.315020 -0.006206   

   origin_1      origin_2      origin_3  
0  0.250000  4.687500e-01  2.812500e-01  
1  1.000000 -2.498002e-16  8.326673e-17  
2  0.410256  1.880342e-01  4.017094e-01  
3  0.912088  3.296703e-02  5.494505e-02  
In [199]:
sns.scatterplot(centroid_df,legend='full')
Out[199]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b644a90ec8>

3.E. Add a new feature in the DataFrame which will have labels based upon cluster value.

In [200]:
## creating a new dataframe only for labels and converting it into categorical variable
df_labels = pd.DataFrame(kmeans.labels_ , columns = list(['labels']))

df_labels['labels'] = df_labels['labels'].astype('category')
In [201]:
# Joining the label dataframe with the data frame.
df_labeled = df.join(df_labels)
In [202]:
df_analysis = (df_labeled.groupby(['labels'] , axis=0)).head(4177)  # the groupby creates a groupeddataframe that needs 
# to be converted back to dataframe. 
df_analysis
Out[202]:
car_name mpg cyl disp hp wt acc yr origin age origin_1 origin_2 origin_3 labels
0 chevrolet chevelle malibu 18.0 8 307.0 130.0 3504 12.0 70 1 13 1 0 0 1
1 buick skylark 320 15.0 8 350.0 165.0 3693 11.5 70 1 13 1 0 0 1
2 plymouth satellite 18.0 8 318.0 150.0 3436 11.0 70 1 13 1 0 0 1
3 amc rebel sst 16.0 8 304.0 150.0 3433 12.0 70 1 13 1 0 0 1
4 ford torino 17.0 8 302.0 140.0 3449 10.5 70 1 13 1 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
393 ford mustang gl 27.0 4 140.0 86.0 2790 15.6 82 1 1 1 0 0 2
394 vw pickup 44.0 4 97.0 52.0 2130 24.6 82 2 1 0 1 0 2
395 dodge rampage 32.0 4 135.0 84.0 2295 11.6 82 1 1 1 0 0 2
396 ford ranger 28.0 4 120.0 79.0 2625 18.6 82 1 1 1 0 0 2
397 chevy s-10 31.0 4 119.0 82.0 2720 19.4 82 1 1 1 0 0 2

398 rows × 14 columns

In [203]:
df_labeled['labels'].value_counts()
Out[203]:
2    117
0     96
1     94
3     91
Name: labels, dtype: int64

3.F. Plot a visual and color the datapoints based upon clusters

In [204]:
from mpl_toolkits.mplot3d import Axes3D
In [205]:
## 3D plots of clusters
fig = plt.figure(figsize=(8, 6))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=20, azim=60)
kmeans.fit(df_new)
labels = kmeans.labels_
ax.scatter(df_new.iloc[:, 0], df_new.iloc[:, 1],df_new.iloc[:,9],c=labels.astype(np.float), edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Length')
ax.set_ylabel('Height')
ax.set_zlabel('Weight')
ax.set_title('3D plot of KMeans Clustering')
Out[205]:
Text(0.5, 0.92, '3D plot of KMeans Clustering')

3.G.Pass a new DataPoint and predict which cluster it belongs to

In [206]:
df_row={'car_name': 'ford torino',
         'mpg': '17.0',
          'cyl': '80',
          'disp': '303.0',
          'hp': '145.0',
          'wt': '3431',
          'acc': '11.5',
          'yr': '80',
          'origin':'1',
          'age': '13',
          'origin_1': '1',
          'origin_2': '0',
          'origin_3':'0'}
In [207]:
df= df.append(df_row,ignore_index=True )
In [208]:
df.shape
Out[208]:
(399, 13)
In [297]:
sns.scatterplot(data=df_new,legend='auto');

PART B :

1. Data Understanding & Cleaning:

1.A. Read ‘vehicle.csv’ and save as DataFrame.

In [210]:
#For numerical libraries
import numpy as np
#To handle data in the form of rows and columns
import pandas as pd
#importing seaborn for statistical plots
import seaborn as sns
#importing ploting libraries
import matplotlib.pyplot as plt
#styling figures
plt.rc('font',size=14)
sns.set(style='white')
sns.set(style='whitegrid',color_codes=True)
#To enable plotting graphs in Jupyter notebook
%matplotlib inline
#importing the Encoding library
from sklearn.preprocessing import LabelEncoder
#Build the model with the best hyper parameters
from sklearn.model_selection import cross_val_score
#importing the zscore for scaling
from scipy.stats import zscore
#Importing PCA for dimensionality reduction and visualization
from sklearn.decomposition import PCA
# Import Logistic Regression machine learning library
from sklearn.linear_model import LogisticRegression 
# Import Support Vector Classifier machine learning library
from sklearn.svm import SVC
#Import Naive Bayes' machine learning Library
from sklearn.naive_bayes import GaussianNB
#Import Sklearn package's data splitting function which is based on random function
from sklearn.model_selection import train_test_split
#Grid search to tune model parameters for SVC
from sklearn.model_selection import GridSearchCV
# Import the metrics
from sklearn import metrics
In [211]:
vehicle_df=pd.read_csv('C:\\Users\\HARITHA\\vehicle.csv')
In [212]:
vehicle_df.head(10)
Out[212]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
5 107 NaN 106.0 172.0 50.0 6 255.0 26.0 28.0 169 280.0 957.0 264.0 85.0 5.0 9.0 181.0 183 bus
6 97 43.0 73.0 173.0 65.0 6 153.0 42.0 19.0 143 176.0 361.0 172.0 66.0 13.0 1.0 200.0 204 bus
7 90 43.0 66.0 157.0 65.0 9 137.0 48.0 18.0 146 162.0 281.0 164.0 67.0 3.0 3.0 193.0 202 van
8 86 34.0 62.0 140.0 61.0 7 122.0 54.0 17.0 127 141.0 223.0 112.0 64.0 2.0 14.0 200.0 208 van
9 93 44.0 98.0 NaN 62.0 11 183.0 36.0 22.0 146 202.0 505.0 152.0 64.0 4.0 14.0 195.0 204 car
In [213]:
vehicle_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   compactness                  846 non-null    int64  
 1   circularity                  841 non-null    float64
 2   distance_circularity         842 non-null    float64
 3   radius_ratio                 840 non-null    float64
 4   pr.axis_aspect_ratio         844 non-null    float64
 5   max.length_aspect_ratio      846 non-null    int64  
 6   scatter_ratio                845 non-null    float64
 7   elongatedness                845 non-null    float64
 8   pr.axis_rectangularity       843 non-null    float64
 9   max.length_rectangularity    846 non-null    int64  
 10  scaled_variance              843 non-null    float64
 11  scaled_variance.1            844 non-null    float64
 12  scaled_radius_of_gyration    844 non-null    float64
 13  scaled_radius_of_gyration.1  842 non-null    float64
 14  skewness_about               840 non-null    float64
 15  skewness_about.1             845 non-null    float64
 16  skewness_about.2             845 non-null    float64
 17  hollows_ratio                846 non-null    int64  
 18  class                        846 non-null    object 
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB

1.B. Check percentage of missing values and impute with correct approach

In [214]:
vehicle_df.isnull().sum()
Out[214]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64
In [215]:
percent_missing = vehicle_df.isnull().sum() * 100 / len(df)
In [216]:
percent_missing
Out[216]:
compactness                    0.000000
circularity                    1.253133
distance_circularity           1.002506
radius_ratio                   1.503759
pr.axis_aspect_ratio           0.501253
max.length_aspect_ratio        0.000000
scatter_ratio                  0.250627
elongatedness                  0.250627
pr.axis_rectangularity         0.751880
max.length_rectangularity      0.000000
scaled_variance                0.751880
scaled_variance.1              0.501253
scaled_radius_of_gyration      0.501253
scaled_radius_of_gyration.1    1.002506
skewness_about                 1.503759
skewness_about.1               0.250627
skewness_about.2               0.250627
hollows_ratio                  0.000000
class                          0.000000
dtype: float64
In [217]:
#class attribute is not an object it is a category
vehicle_df['class']=vehicle_df['class'].astype('category')
In [218]:
#To get the shape 
vehicle_df.shape
Out[218]:
(846, 19)
In [219]:
#To get the number of columns
vehicle_df.columns
Out[219]:
Index(['compactness', 'circularity', 'distance_circularity', 'radius_ratio',
       'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio',
       'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity',
       'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration',
       'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1',
       'skewness_about.2', 'hollows_ratio', 'class'],
      dtype='object')
In [220]:
#Checking for missing values in the dataset
vehicle_df.isnull().sum()
Out[220]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64
In [221]:
#replace missing variable('?') into null variable using numpy
vehicle_df = vehicle_df.replace(' ', np.nan)
In [222]:
#Replacing the missing values by median 
for i in vehicle_df.columns[:17]:
    median_value = vehicle_df[i].median()
    vehicle_df[i] = vehicle_df[i].fillna(median_value)
In [223]:
# again check for missing values
vehicle_df.isnull().sum()
Out[223]:
compactness                    0
circularity                    0
distance_circularity           0
radius_ratio                   0
pr.axis_aspect_ratio           0
max.length_aspect_ratio        0
scatter_ratio                  0
elongatedness                  0
pr.axis_rectangularity         0
max.length_rectangularity      0
scaled_variance                0
scaled_variance.1              0
scaled_radius_of_gyration      0
scaled_radius_of_gyration.1    0
skewness_about                 0
skewness_about.1               0
skewness_about.2               0
hollows_ratio                  0
class                          0
dtype: int64
In [224]:
# Again check data information
vehicle_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   compactness                  846 non-null    int64   
 1   circularity                  846 non-null    float64 
 2   distance_circularity         846 non-null    float64 
 3   radius_ratio                 846 non-null    float64 
 4   pr.axis_aspect_ratio         846 non-null    float64 
 5   max.length_aspect_ratio      846 non-null    int64   
 6   scatter_ratio                846 non-null    float64 
 7   elongatedness                846 non-null    float64 
 8   pr.axis_rectangularity       846 non-null    float64 
 9   max.length_rectangularity    846 non-null    int64   
 10  scaled_variance              846 non-null    float64 
 11  scaled_variance.1            846 non-null    float64 
 12  scaled_radius_of_gyration    846 non-null    float64 
 13  scaled_radius_of_gyration.1  846 non-null    float64 
 14  skewness_about               846 non-null    float64 
 15  skewness_about.1             846 non-null    float64 
 16  skewness_about.2             846 non-null    float64 
 17  hollows_ratio                846 non-null    int64   
 18  class                        846 non-null    category
dtypes: category(1), float64(14), int64(4)
memory usage: 120.0 KB
In [225]:
# Understand the spread and outliers in dataset using boxplot
vehicle_df.boxplot(figsize=(35,15));
In [226]:
# Histogram 
vehicle_df.hist(figsize=(15,15));
In [227]:
#find the outliers and replace them by median
for col_name in vehicle_df.columns[:-1]:
    q1 = vehicle_df[col_name].quantile(0.25)
    q3 = vehicle_df[col_name].quantile(0.75)
    iqr = q3 - q1
    
    low = q1-1.5*iqr
    high = q3+1.5*iqr
    
    vehicle_df.loc[(vehicle_df[col_name] < low) | (vehicle_df[col_name] > high), col_name] = vehicle_df[col_name].median()
In [228]:
# again check for outliers in dataset using boxplot
vehicle_df.boxplot(figsize=(35,15));
In [229]:
print('Class: \n', vehicle_df['class'].unique())
Class: 
 [van, car, bus]
Categories (3, object): [van, car, bus]
In [230]:
vehicle_df['class'].value_counts()
Out[230]:
car    429
bus    218
van    199
Name: class, dtype: int64
In [231]:
#Encoding of categorical variables
labelencoder_X=LabelEncoder()
vehicle_df['class']=labelencoder_X.fit_transform(vehicle_df['class'])
In [232]:
#correlation matrix
cor=vehicle_df.corr()
cor
Out[232]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
compactness 1.000000 0.684887 0.789928 0.721925 0.192864 0.499928 0.812620 -0.788750 0.813694 0.676143 0.769871 0.806170 0.585243 -0.246681 0.197308 0.156348 0.298537 0.365552 -0.033796
circularity 0.684887 1.000000 0.792320 0.638280 0.203253 0.560470 0.847938 -0.821472 0.843400 0.961318 0.802768 0.827462 0.925816 0.068745 0.136351 -0.009666 -0.104426 0.046351 -0.158910
distance_circularity 0.789928 0.792320 1.000000 0.794222 0.244332 0.666809 0.905076 -0.911307 0.893025 0.774527 0.869584 0.883943 0.705771 -0.229353 0.099107 0.262345 0.146098 0.332732 -0.064467
radius_ratio 0.721925 0.638280 0.794222 1.000000 0.650554 0.463958 0.769941 -0.825392 0.744139 0.579468 0.786183 0.760257 0.550774 -0.390459 0.035755 0.179601 0.405849 0.491758 -0.213948
pr.axis_aspect_ratio 0.192864 0.203253 0.244332 0.650554 1.000000 0.150295 0.194195 -0.298144 0.163047 0.147592 0.207101 0.196401 0.148591 -0.321070 -0.056030 -0.021088 0.400882 0.415734 -0.209298
max.length_aspect_ratio 0.499928 0.560470 0.666809 0.463958 0.150295 1.000000 0.490759 -0.504181 0.487931 0.642713 0.401391 0.463249 0.397397 -0.335444 0.081898 0.141664 0.083794 0.413174 0.352958
scatter_ratio 0.812620 0.847938 0.905076 0.769941 0.194195 0.490759 1.000000 -0.971601 0.989751 0.809083 0.960883 0.980447 0.799875 0.011314 0.064242 0.211647 0.005628 0.118817 -0.288895
elongatedness -0.788750 -0.821472 -0.911307 -0.825392 -0.298144 -0.504181 -0.971601 1.000000 -0.948996 -0.775854 -0.947644 -0.948851 -0.766314 0.078391 -0.046943 -0.183642 -0.115126 -0.216905 0.339344
pr.axis_rectangularity 0.813694 0.843400 0.893025 0.744139 0.163047 0.487931 0.989751 -0.948996 1.000000 0.810934 0.947329 0.973606 0.796690 0.027545 0.073127 0.213801 -0.018649 0.099286 -0.258481
max.length_rectangularity 0.676143 0.961318 0.774527 0.579468 0.147592 0.642713 0.809083 -0.775854 0.810934 1.000000 0.750222 0.789632 0.866450 0.053856 0.130702 0.004129 -0.103948 0.076770 -0.032399
scaled_variance 0.769871 0.802768 0.869584 0.786183 0.207101 0.401391 0.960883 -0.947644 0.947329 0.750222 1.000000 0.943780 0.785073 0.025828 0.024693 0.197122 0.015171 0.086330 -0.324062
scaled_variance.1 0.806170 0.827462 0.883943 0.760257 0.196401 0.463249 0.980447 -0.948851 0.973606 0.789632 0.943780 1.000000 0.782972 0.009386 0.065731 0.204941 0.017557 0.119642 -0.279487
scaled_radius_of_gyration 0.585243 0.925816 0.705771 0.550774 0.148591 0.397397 0.799875 -0.766314 0.796690 0.866450 0.785073 0.782972 1.000000 0.215279 0.162970 -0.055667 -0.224450 -0.118002 -0.250267
scaled_radius_of_gyration.1 -0.246681 0.068745 -0.229353 -0.390459 -0.321070 -0.335444 0.011314 0.078391 0.027545 0.053856 0.025828 0.009386 0.215279 1.000000 -0.057755 -0.123996 -0.832738 -0.901332 -0.283540
skewness_about 0.197308 0.136351 0.099107 0.035755 -0.056030 0.081898 0.064242 -0.046943 0.073127 0.130702 0.024693 0.065731 0.162970 -0.057755 1.000000 -0.041734 0.086661 0.062619 0.126720
skewness_about.1 0.156348 -0.009666 0.262345 0.179601 -0.021088 0.141664 0.211647 -0.183642 0.213801 0.004129 0.197122 0.204941 -0.055667 -0.123996 -0.041734 1.000000 0.074473 0.200651 -0.010872
skewness_about.2 0.298537 -0.104426 0.146098 0.405849 0.400882 0.083794 0.005628 -0.115126 -0.018649 -0.103948 0.015171 0.017557 -0.224450 -0.832738 0.086661 0.074473 1.000000 0.892581 0.067244
hollows_ratio 0.365552 0.046351 0.332732 0.491758 0.415734 0.413174 0.118817 -0.216905 0.099286 0.076770 0.086330 0.119642 -0.118002 -0.901332 0.062619 0.200651 0.892581 1.000000 0.235874
class -0.033796 -0.158910 -0.064467 -0.213948 -0.209298 0.352958 -0.288895 0.339344 -0.258481 -0.032399 -0.324062 -0.279487 -0.250267 -0.283540 0.126720 -0.010872 0.067244 0.235874 1.000000
In [233]:
# correlation plot---heatmap
sns.set(font_scale=1.15)
fig,ax=plt.subplots(figsize=(18,15))
sns.heatmap(cor,vmin=0.8, annot=True,linewidths=0.01,center=0,linecolor="white",cbar=False,square=True)
plt.title('Correlation between attributes',fontsize=18)
ax.tick_params(labelsize=18)
In [234]:
#pair panel
sns.pairplot(vehicle_df,hue='class');

1.C. Visualize a Pie-chart and print percentage of values for variable ‘class’

In [235]:
vehicle_df['class'].value_counts()
Out[235]:
1    429
0    218
2    199
Name: class, dtype: int64
In [236]:
per_class=vehicle_df['class'].value_counts() * 100 / len(vehicle_df['class'])
In [237]:
per_class
Out[237]:
1    50.709220
0    25.768322
2    23.522459
Name: class, dtype: float64
In [238]:
plt.pie(per_class, autopct='%1.1f%%');

1.D. Check for duplicate rows in the data and impute with correct approach.

In [239]:
vehicle_df.duplicated().sum()
Out[239]:
0

2. Data Preparation:

2.A. Split data into X and Y. [Train and Test optional]

In [240]:
#independent and dependent variables
X=vehicle_df.iloc[:,0:18]
y = vehicle_df.iloc[:,18]
In [241]:
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 10)
In [242]:
model = LogisticRegression()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
C:\Users\HARITHA\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,
In [243]:
# check the accuracy on the training data
print('Accuracy on Training data: ',model.score(X_train, y_train))
# check the accuracy on the testing data
print('Accuracy on Testing data: ',model.score(X_test , y_test))
#Calculate the recall value 
print('Recall value: ',metrics.recall_score(y_test, prediction, average='macro'))
#Calculate the precision value 
print('Precision value: ',metrics.precision_score(y_test, prediction, average='macro'))
print("Confusion Matrix:\n",metrics.confusion_matrix(prediction,y_test))
print("Classification Report:\n",metrics.classification_report(prediction,y_test))
Accuracy on Training data:  0.9003378378378378
Accuracy on Testing data:  0.8858267716535433
Recall value:  0.881797636393071
Precision value:  0.8811525423728813
Confusion Matrix:
 [[ 63   5   2]
 [  7 112   6]
 [  1   8  50]]
Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.90      0.89        70
           1       0.90      0.90      0.90       125
           2       0.86      0.85      0.85        59

    accuracy                           0.89       254
   macro avg       0.88      0.88      0.88       254
weighted avg       0.89      0.89      0.89       254

  • The accuracy, recall, and precision values using Logistic Regression model is high without applying dimentionality reduction (PCA).
In [244]:
resultsDf=pd.DataFrame({'Model':['Logistic'],'Accuracy': model.score(X_test , y_test)},index={'1'})
resultsDf=resultsDf[['Model','Accuracy']]
resultsDf
Out[244]:
Model Accuracy
1 Logistic 0.885827
In [245]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
In [246]:
# check the accuracy on the training data
print('Accuracy on Training data: ',model.score(X_train, y_train))
# check the accuracy on the testing data
print('Accuracy on Testing data: ', model.score(X_test , y_test))
#Calculate the recall value 
print('Recall value: ',metrics.recall_score(y_test, prediction, average='macro'))
#Calculate the precision value 
print('Precision value: ',metrics.precision_score(y_test, prediction, average='macro'))
print("Confusion Matrix:\n",metrics.confusion_matrix(prediction,y_test))
print("Classification Report:\n",metrics.classification_report(prediction,y_test))
Accuracy on Training data:  0.6266891891891891
Accuracy on Testing data:  0.594488188976378
Recall value:  0.608675408774486
Precision value:  0.7405060217560218
Confusion Matrix:
 [[15  0  0]
 [16 79  1]
 [40 46 57]]
Classification Report:
               precision    recall  f1-score   support

           0       0.21      1.00      0.35        15
           1       0.63      0.82      0.71        96
           2       0.98      0.40      0.57       143

    accuracy                           0.59       254
   macro avg       0.61      0.74      0.54       254
weighted avg       0.80      0.59      0.61       254

In [247]:
#Store the accuracy results for each kernel in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Model':['Naive Bayes'], 'Accuracy': model.score(X_test, y_test)},index={'2'})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Model','Accuracy']]
resultsDf
Out[247]:
Model Accuracy
1 Logistic 0.885827
2 Naive Bayes 0.594488
In [248]:
clf = SVC()
clf.fit(X_train, y_train)
prediction = model.predict(X_test)

3.B. Print Classification metrics for train data.

In [249]:
# check the accuracy on the training data
print('Accuracy on Training data: ',model.score(X_train, y_train))
# check the accuracy on the testing data
print('Accuracy on Testing data: ', model.score(X_test , y_test))
#Calculate the recall value 
print('Recall value: ',metrics.recall_score(y_test, prediction, average='macro'))
#Calculate the precision value 
print('Precision value: ',metrics.precision_score(y_test, prediction, average='macro'))
print("Confusion Matrix:\n",metrics.confusion_matrix(prediction,y_test))
print("Classification Report:\n",metrics.classification_report(prediction,y_test))
Accuracy on Training data:  0.6266891891891891
Accuracy on Testing data:  0.594488188976378
Recall value:  0.608675408774486
Precision value:  0.7405060217560218
Confusion Matrix:
 [[15  0  0]
 [16 79  1]
 [40 46 57]]
Classification Report:
               precision    recall  f1-score   support

           0       0.21      1.00      0.35        15
           1       0.63      0.82      0.71        96
           2       0.98      0.40      0.57       143

    accuracy                           0.59       254
   macro avg       0.61      0.74      0.54       254
weighted avg       0.80      0.59      0.61       254

In [250]:
#Store the accuracy results for each kernel in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Model':['SVM'], 'Accuracy': model.score(X_test, y_test)},index={'3'})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Model','Accuracy']]
resultsDf
Out[250]:
Model Accuracy
1 Logistic 0.885827
2 Naive Bayes 0.594488
3 SVM 0.594488

3.C. Apply PCA on the data with 10 components.

In [251]:
# Scaling the independent attributes using zscore
X_z=X.apply(zscore)
In [252]:
# prior to scaling
plt.rcParams['figure.figsize']=(10,6)
plt.plot(vehicle_df)
plt.show()
In [253]:
#plt.plot(X_z,figsize=(20,10))
plt.rcParams['figure.figsize']=(10,6)
plt.plot(X_z)
plt.show()
In [254]:
# Calculating the covariance between attributes after scaling
cov_matrix = np.cov(X_z.T)
print('Covariance Matrix \n%s', cov_matrix)
Covariance Matrix 
%s [[ 1.00118343  0.68569786  0.79086299  0.72277977  0.1930925   0.50051942
   0.81358214 -0.78968322  0.81465658  0.67694334  0.77078163  0.80712401
   0.58593517 -0.24697246  0.19754181  0.1565327   0.29889034  0.36598446]
 [ 0.68569786  1.00118343  0.79325751  0.63903532  0.20349327  0.5611334
   0.8489411  -0.82244387  0.84439802  0.96245572  0.80371846  0.82844154
   0.92691166  0.06882659  0.13651201 -0.00967793 -0.10455005  0.04640562]
 [ 0.79086299  0.79325751  1.00118343  0.79516215  0.24462154  0.66759792
   0.90614687 -0.9123854   0.89408198  0.77544391  0.87061349  0.88498924
   0.70660663 -0.22962442  0.09922417  0.26265581  0.14627113  0.33312625]
 [ 0.72277977  0.63903532  0.79516215  1.00118343  0.65132393  0.46450748
   0.77085211 -0.82636872  0.74502008  0.58015378  0.78711387  0.76115704
   0.55142559 -0.39092105  0.03579728  0.17981316  0.40632957  0.49234013]
 [ 0.1930925   0.20349327  0.24462154  0.65132393  1.00118343  0.15047265
   0.19442484 -0.29849719  0.16323988  0.14776643  0.20734569  0.19663295
   0.14876723 -0.32144977 -0.05609621 -0.02111342  0.401356    0.41622574]
 [ 0.50051942  0.5611334   0.66759792  0.46450748  0.15047265  1.00118343
   0.49133933 -0.50477756  0.48850876  0.64347365  0.40186618  0.46379685
   0.39786723 -0.33584133  0.08199536  0.14183116  0.08389276  0.41366325]
 [ 0.81358214  0.8489411   0.90614687  0.77085211  0.19442484  0.49133933
   1.00118343 -0.97275069  0.99092181  0.81004084  0.96201996  0.98160681
   0.80082111  0.01132718  0.06431825  0.21189733  0.00563439  0.1189581 ]
 [-0.78968322 -0.82244387 -0.9123854  -0.82636872 -0.29849719 -0.50477756
  -0.97275069  1.00118343 -0.95011894 -0.77677186 -0.94876596 -0.94997386
  -0.76722075  0.07848365 -0.04699819 -0.18385891 -0.11526213 -0.2171615 ]
 [ 0.81465658  0.84439802  0.89408198  0.74502008  0.16323988  0.48850876
   0.99092181 -0.95011894  1.00118343  0.81189327  0.94845027  0.97475823
   0.79763248  0.02757736  0.07321311  0.21405404 -0.01867064  0.09940372]
 [ 0.67694334  0.96245572  0.77544391  0.58015378  0.14776643  0.64347365
   0.81004084 -0.77677186  0.81189327  1.00118343  0.75110957  0.79056684
   0.86747579  0.05391989  0.13085669  0.00413356 -0.10407076  0.07686047]
 [ 0.77078163  0.80371846  0.87061349  0.78711387  0.20734569  0.40186618
   0.96201996 -0.94876596  0.94845027  0.75110957  1.00118343  0.94489677
   0.78600191  0.02585841  0.02472235  0.19735505  0.01518932  0.08643233]
 [ 0.80712401  0.82844154  0.88498924  0.76115704  0.19663295  0.46379685
   0.98160681 -0.94997386  0.97475823  0.79056684  0.94489677  1.00118343
   0.78389866  0.00939688  0.0658085   0.20518392  0.01757781  0.11978365]
 [ 0.58593517  0.92691166  0.70660663  0.55142559  0.14876723  0.39786723
   0.80082111 -0.76722075  0.79763248  0.86747579  0.78600191  0.78389866
   1.00118343  0.21553366  0.16316265 -0.05573322 -0.22471583 -0.11814142]
 [-0.24697246  0.06882659 -0.22962442 -0.39092105 -0.32144977 -0.33584133
   0.01132718  0.07848365  0.02757736  0.05391989  0.02585841  0.00939688
   0.21553366  1.00118343 -0.05782288 -0.12414277 -0.83372383 -0.90239877]
 [ 0.19754181  0.13651201  0.09922417  0.03579728 -0.05609621  0.08199536
   0.06431825 -0.04699819  0.07321311  0.13085669  0.02472235  0.0658085
   0.16316265 -0.05782288  1.00118343 -0.04178316  0.0867631   0.06269293]
 [ 0.1565327  -0.00967793  0.26265581  0.17981316 -0.02111342  0.14183116
   0.21189733 -0.18385891  0.21405404  0.00413356  0.19735505  0.20518392
  -0.05573322 -0.12414277 -0.04178316  1.00118343  0.07456104  0.20088894]
 [ 0.29889034 -0.10455005  0.14627113  0.40632957  0.401356    0.08389276
   0.00563439 -0.11526213 -0.01867064 -0.10407076  0.01518932  0.01757781
  -0.22471583 -0.83372383  0.0867631   0.07456104  1.00118343  0.89363767]
 [ 0.36598446  0.04640562  0.33312625  0.49234013  0.41622574  0.41366325
   0.1189581  -0.2171615   0.09940372  0.07686047  0.08643233  0.11978365
  -0.11814142 -0.90239877  0.06269293  0.20088894  0.89363767  1.00118343]]
In [255]:
#Finding eigenvalues amd eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eigenvectors)
print('\n Eigen Values \n%s', eigenvalues)
Eigen Vectors 
%s [[-2.72502890e-01 -8.70435783e-02  3.81852075e-02  1.38675013e-01
  -1.37101466e-01  2.63611383e-01  2.02717114e-01 -7.58796410e-01
   3.66685918e-01  1.60045219e-01  8.40252779e-02  2.14645175e-02
  -1.87350749e-02  6.89082276e-02  4.26105276e-02  9.97784975e-02
  -8.22590084e-02 -3.30366937e-02]
 [-2.87254690e-01  1.31621757e-01  2.01146908e-01 -3.80554832e-02
   1.38995553e-01 -7.13474241e-02 -3.92275358e-01 -6.76034223e-02
   5.53261885e-02 -1.82323962e-01 -3.65229874e-02  1.47247511e-01
  -4.89102355e-02  5.90534770e-02 -6.74107885e-01  1.63466948e-01
  -2.59100771e-01  2.48832011e-01]
 [-3.02421105e-01 -4.61430061e-02 -6.34621085e-02  1.08954287e-01
   8.00174278e-02 -1.69006151e-02  1.63371282e-01  2.77371950e-01
   7.46784853e-02  2.73033778e-01  4.68505530e-01  6.52730855e-01
   4.74162132e-03 -1.62108150e-01 -4.99754439e-04 -6.36582307e-02
   1.20629778e-01  9.80561531e-02]
 [-2.69713545e-01 -1.97931263e-01 -5.62851689e-02 -2.54355087e-01
  -1.33744367e-01 -1.38183653e-01  1.61910525e-01  1.10544748e-01
   2.66666666e-01 -5.05987218e-02 -5.45526034e-01  7.52188680e-02
   3.70499547e-03 -3.93288246e-01  1.74861248e-01 -1.33284415e-01
  -1.86241567e-01  3.60765151e-01]
 [-9.78607336e-02 -2.57839952e-01  6.19927464e-02 -6.12765722e-01
  -1.23601456e-01 -5.77828612e-01  9.27633094e-02 -1.86858758e-01
  -3.86296562e-02 -3.43037888e-02  2.65023238e-01 -2.40287269e-02
   8.90928349e-03  1.63771153e-01 -6.31976228e-02  2.14665592e-02
   1.24639367e-01 -1.77647590e-01]
 [-1.95200137e-01 -1.08045626e-01  1.48957820e-01  2.78678159e-01
   6.34893356e-01 -2.89096995e-01  3.98266293e-01 -4.62187969e-02
  -1.37163365e-01  1.77960797e-01 -1.92846020e-01 -2.29741488e-01
   4.09727876e-03  1.36576102e-01 -9.62482815e-02 -6.89934316e-02
   1.40804371e-01  9.99006987e-02]
 [-3.10523932e-01  7.52853487e-02 -1.09042833e-01  5.39294828e-03
  -8.55574543e-02  9.77471088e-02  9.23519412e-02  6.46204209e-02
  -1.31567659e-01 -1.43132644e-01  9.67172431e-02 -1.53118496e-01
   8.55513044e-01  6.48917601e-02 -4.36596954e-02 -1.56585696e-01
  -1.43109720e-01 -5.28457504e-02]
 [ 3.09006904e-01 -1.32299375e-02  9.08526930e-02  6.52148575e-02
   7.90734442e-02 -7.57282937e-02 -1.04070600e-01 -1.92342823e-01
   2.89633509e-01 -7.93831124e-02 -2.29926427e-02  2.33454000e-02
   2.61858734e-01 -4.96273257e-01 -3.08568675e-01 -2.44030327e-01
   5.11966770e-01 -9.49906147e-02]
 [-3.07287000e-01  8.75601978e-02 -1.06070496e-01  3.08991500e-02
  -8.16463820e-02  1.05403228e-01  9.31317767e-02  1.38684573e-02
  -8.95291026e-02 -2.39896699e-01  1.59356923e-01 -2.17636238e-01
  -4.22479708e-01 -1.13664100e-01 -1.63739102e-01 -6.71547392e-01
  -6.75916711e-02 -2.16727165e-01]
 [-2.78154157e-01  1.22154240e-01  2.13684693e-01  4.14674720e-02
   2.51112937e-01 -7.81962142e-02 -3.54564344e-01 -2.15163418e-01
  -1.58231983e-01 -3.82739482e-01 -1.42837015e-01  3.15261003e-01
   2.00493082e-02 -8.66067604e-03  5.08763287e-01 -5.00643538e-02
   1.60926059e-01 -2.00262071e-01]
 [-2.99765086e-01  7.72657535e-02 -1.44599805e-01 -6.40050869e-02
  -1.47471227e-01  1.32912405e-01  6.80546125e-02  1.95678724e-01
   4.27034669e-02  1.66090908e-01 -4.59667614e-01  1.18383161e-01
  -4.15194745e-02  1.35985919e-01 -2.52182911e-01  2.17416166e-01
   3.24139804e-01 -5.53139002e-01]
 [-3.05532374e-01  7.15030171e-02 -1.10343735e-01 -2.19687048e-03
  -1.10100984e-01  1.15398218e-01  9.01194270e-02  3.77948210e-02
  -1.51072666e-01 -2.87457686e-01  2.09345615e-01 -3.31340876e-01
  -1.22365190e-01 -2.42922436e-01  3.94502237e-02  4.48936624e-01
   4.62827872e-01  3.22499534e-01]
 [-2.63237620e-01  2.10582046e-01  2.02870191e-01 -8.55396458e-02
   5.21210685e-03 -6.70573978e-02 -4.55292717e-01  1.46752664e-01
   2.63771332e-01  5.49626527e-01  1.07713508e-01 -3.99260390e-01
   1.66056546e-02 -3.30876118e-02  2.03029913e-01 -1.06621517e-01
   8.55669069e-02  2.40609291e-02]
 [ 4.19359352e-02  5.03621577e-01 -7.38640211e-02 -1.15399624e-01
  -1.38068605e-01 -1.31513077e-01  8.58226790e-02 -3.30394999e-01
  -5.55267166e-01  3.62547303e-01 -1.26596148e-01  1.21942784e-01
   1.27186667e-03 -2.96030848e-01 -5.79407509e-02 -3.08034829e-02
  -5.10909842e-02  8.79644677e-02]
 [-3.60832115e-02 -1.57663214e-02  5.59173987e-01  4.73703309e-01
  -5.66552244e-01 -3.19176094e-01  1.24532179e-01  1.14255395e-01
  -5.99039250e-02 -5.79891873e-02 -3.25785780e-02  2.88590518e-03
  -4.24341185e-04  4.01635562e-03 -8.22261600e-03  2.05544442e-02
  -4.39201991e-03 -3.76172016e-02]
 [-5.87204797e-02 -9.27462386e-02 -6.70680496e-01  4.28426032e-01
  -1.30869817e-01 -4.68404967e-01 -3.02517700e-01 -1.15403870e-01
   5.23845772e-02  1.28995278e-02 -3.62255133e-02 -1.62495314e-02
  -9.40554994e-03  8.00562035e-02  1.12172401e-02 -2.31296836e-03
   1.13702813e-02  4.44850199e-02]
 [-3.80131449e-02 -5.01621218e-01  6.22407145e-02 -2.74095968e-02
  -1.80519293e-01  2.80136438e-01 -2.58250261e-01 -9.46599623e-02
  -3.79168935e-01  1.87848521e-01 -1.38657118e-01  8.24506703e-02
   2.60800892e-02  2.45816461e-01 -7.88567114e-02 -2.81093089e-01
   3.19960307e-01  3.19055407e-01]
 [-8.47399995e-02 -5.07612106e-01  4.17053530e-02  9.60374943e-02
   1.10788067e-01  5.94444089e-02 -1.73269228e-01 -6.49718344e-03
  -2.80340510e-01  1.33402674e-01  8.39926899e-02 -1.29951586e-01
  -4.18109835e-03 -5.18420304e-01 -3.18514877e-02  2.41164948e-01
  -3.10989286e-01 -3.65128378e-01]]

 Eigen Values 
%s [9.74940269e+00 3.35071912e+00 1.19238155e+00 1.13381916e+00
 8.83997312e-01 6.66265745e-01 3.18150910e-01 2.28179142e-01
 1.31018595e-01 7.98619108e-02 7.33979478e-02 6.46162669e-02
 5.16287320e-03 4.01448646e-02 1.98136761e-02 2.27005257e-02
 3.22758478e-02 2.93936408e-02]
In [256]:
# Make a set of (eigenvalue, eigenvector) pairs
eigen_pairs = [(np.abs(eigenvalues[i]), eigenvectors[:,i]) for i in range(len(eigenvalues))]
eigen_pairs.sort(reverse=True)
eigen_pairs[:]
Out[256]:
[(9.749402689379597,
  array([-0.27250289, -0.28725469, -0.30242111, -0.26971354, -0.09786073,
         -0.19520014, -0.31052393,  0.3090069 , -0.307287  , -0.27815416,
         -0.29976509, -0.30553237, -0.26323762,  0.04193594, -0.03608321,
         -0.05872048, -0.03801314, -0.08474   ])),
 (3.3507191194129806,
  array([-0.08704358,  0.13162176, -0.04614301, -0.19793126, -0.25783995,
         -0.10804563,  0.07528535, -0.01322994,  0.0875602 ,  0.12215424,
          0.07726575,  0.07150302,  0.21058205,  0.50362158, -0.01576632,
         -0.09274624, -0.50162122, -0.50761211])),
 (1.1923815452731596,
  array([ 0.03818521,  0.20114691, -0.06346211, -0.05628517,  0.06199275,
          0.14895782, -0.10904283,  0.09085269, -0.1060705 ,  0.21368469,
         -0.1445998 , -0.11034374,  0.20287019, -0.07386402,  0.55917399,
         -0.6706805 ,  0.06224071,  0.04170535])),
 (1.1338191632147836,
  array([ 0.13867501, -0.03805548,  0.10895429, -0.25435509, -0.61276572,
          0.27867816,  0.00539295,  0.06521486,  0.03089915,  0.04146747,
         -0.06400509, -0.00219687, -0.08553965, -0.11539962,  0.47370331,
          0.42842603, -0.0274096 ,  0.09603749])),
 (0.883997312003609,
  array([-0.13710147,  0.13899555,  0.08001743, -0.13374437, -0.12360146,
          0.63489336, -0.08555745,  0.07907344, -0.08164638,  0.25111294,
         -0.14747123, -0.11010098,  0.00521211, -0.1380686 , -0.56655224,
         -0.13086982, -0.18051929,  0.11078807])),
 (0.6662657454310781,
  array([ 0.26361138, -0.07134742, -0.01690062, -0.13818365, -0.57782861,
         -0.289097  ,  0.09774711, -0.07572829,  0.10540323, -0.07819621,
          0.1329124 ,  0.11539822, -0.0670574 , -0.13151308, -0.31917609,
         -0.46840497,  0.28013644,  0.05944441])),
 (0.31815090958438447,
  array([ 0.20271711, -0.39227536,  0.16337128,  0.16191053,  0.09276331,
          0.39826629,  0.09235194, -0.1040706 ,  0.09313178, -0.35456434,
          0.06805461,  0.09011943, -0.45529272,  0.08582268,  0.12453218,
         -0.3025177 , -0.25825026, -0.17326923])),
 (0.22817914211554072,
  array([-0.75879641, -0.06760342,  0.27737195,  0.11054475, -0.18685876,
         -0.0462188 ,  0.06462042, -0.19234282,  0.01386846, -0.21516342,
          0.19567872,  0.03779482,  0.14675266, -0.330395  ,  0.1142554 ,
         -0.11540387, -0.09465996, -0.00649718])),
 (0.13101859512585465,
  array([ 0.36668592,  0.05532619,  0.07467849,  0.26666667, -0.03862966,
         -0.13716337, -0.13156766,  0.28963351, -0.0895291 , -0.15823198,
          0.04270347, -0.15107267,  0.26377133, -0.55526717, -0.05990393,
          0.05238458, -0.37916894, -0.28034051])),
 (0.07986191082036483,
  array([ 0.16004522, -0.18232396,  0.27303378, -0.05059872, -0.03430379,
          0.1779608 , -0.14313264, -0.07938311, -0.2398967 , -0.38273948,
          0.16609091, -0.28745769,  0.54962653,  0.3625473 , -0.05798919,
          0.01289953,  0.18784852,  0.13340267])),
 (0.07339794782509117,
  array([ 0.08402528, -0.03652299,  0.46850553, -0.54552603,  0.26502324,
         -0.19284602,  0.09671724, -0.02299264,  0.15935692, -0.14283702,
         -0.45966761,  0.20934562,  0.10771351, -0.12659615, -0.03257858,
         -0.03622551, -0.13865712,  0.08399269])),
 (0.06461626687535524,
  array([ 0.02146452,  0.14724751,  0.65273085,  0.07521887, -0.02402873,
         -0.22974149, -0.1531185 ,  0.0233454 , -0.21763624,  0.315261  ,
          0.11838316, -0.33134088, -0.39926039,  0.12194278,  0.00288591,
         -0.01624953,  0.08245067, -0.12995159])),
 (0.04014486457709927,
  array([ 0.06890823,  0.05905348, -0.16210815, -0.39328825,  0.16377115,
          0.1365761 ,  0.06489176, -0.49627326, -0.1136641 , -0.00866068,
          0.13598592, -0.24292244, -0.03308761, -0.29603085,  0.00401636,
          0.0800562 ,  0.24581646, -0.5184203 ])),
 (0.032275847766898146,
  array([-0.08225901, -0.25910077,  0.12062978, -0.18624157,  0.12463937,
          0.14080437, -0.14310972,  0.51196677, -0.06759167,  0.16092606,
          0.3241398 ,  0.46282787,  0.08556691, -0.05109098, -0.00439202,
          0.01137028,  0.31996031, -0.31098929])),
 (0.029393640750312124,
  array([-0.03303669,  0.24883201,  0.09805615,  0.36076515, -0.17764759,
          0.0999007 , -0.05284575, -0.09499061, -0.21672717, -0.20026207,
         -0.553139  ,  0.32249953,  0.02406093,  0.08796447, -0.0376172 ,
          0.04448502,  0.31905541, -0.36512838])),
 (0.02270052570621986,
  array([ 0.0997785 ,  0.16346695, -0.06365823, -0.13328441,  0.02146656,
         -0.06899343, -0.1565857 , -0.24403033, -0.67154739, -0.05006435,
          0.21741617,  0.44893662, -0.10662152, -0.03080348,  0.02055444,
         -0.00231297, -0.28109309,  0.24116495])),
 (0.019813676080863783,
  array([ 4.26105276e-02, -6.74107885e-01, -4.99754439e-04,  1.74861248e-01,
         -6.31976228e-02, -9.62482815e-02, -4.36596954e-02, -3.08568675e-01,
         -1.63739102e-01,  5.08763287e-01, -2.52182911e-01,  3.94502237e-02,
          2.03029913e-01, -5.79407509e-02, -8.22261600e-03,  1.12172401e-02,
         -7.88567114e-02, -3.18514877e-02])),
 (0.005162873204745595,
  array([-1.87350749e-02, -4.89102355e-02,  4.74162132e-03,  3.70499547e-03,
          8.90928349e-03,  4.09727876e-03,  8.55513044e-01,  2.61858734e-01,
         -4.22479708e-01,  2.00493082e-02, -4.15194745e-02, -1.22365190e-01,
          1.66056546e-02,  1.27186667e-03, -4.24341185e-04, -9.40554994e-03,
          2.60800892e-02, -4.18109835e-03]))]
In [257]:
# print out eigenvalues
print('Eigenvalues in descending order: \n%s' %eigenvalues)
Eigenvalues in descending order: 
[9.74940269e+00 3.35071912e+00 1.19238155e+00 1.13381916e+00
 8.83997312e-01 6.66265745e-01 3.18150910e-01 2.28179142e-01
 1.31018595e-01 7.98619108e-02 7.33979478e-02 6.46162669e-02
 5.16287320e-03 4.01448646e-02 1.98136761e-02 2.27005257e-02
 3.22758478e-02 2.93936408e-02]
In [258]:
tot = sum(eigenvalues)
var_exp = [( i /tot ) * 100 for i in sorted(eigenvalues, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
plt.plot(var_exp)
Cumulative Variance Explained [ 54.0993254   72.69242795  79.30893968  85.60048941  90.50578051
  94.2028816   95.96829741  97.23446089  97.96148159  98.40463444
  98.81191882  99.17047375  99.39323715  99.57233547  99.73544045
  99.86140541  99.97135127 100.        ]
Out[258]:
[<matplotlib.lines.Line2D at 0x1b650c79c48>]
In [259]:
# Ploting 
plt.figure(figsize=(8 , 7))
plt.bar(range(1, eigenvalues.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eigenvalues.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()
In [260]:
# Reducing from 17 to 10 dimension space
pca = PCA(n_components=10)
data_reduced = pca.fit_transform(X_z)
data_reduced.transpose()
Out[260]:
array([[ 0.58422804, -1.5121798 ,  3.91344816, ...,  5.12009307,
        -3.29709502, -4.96759448],
       [-0.67567325, -0.34893367,  0.2345073 , ..., -0.18227007,
        -1.10194286,  0.42274968],
       [-0.45333356, -0.33343619, -1.26509352, ..., -0.50836783,
         1.93384417,  1.30871531],
       ...,
       [-0.68196902,  0.10442512,  0.17305277, ..., -0.38820845,
         0.45880709, -0.21433678],
       [ 0.31266966, -0.29625823,  0.19108534, ..., -0.07735512,
         0.82142229,  0.59676772],
       [ 0.14411602, -0.39097765, -0.52948668, ...,  0.55527162,
        -0.34059305,  0.10856429]])
In [261]:
pca.components_
Out[261]:
array([[ 0.27250289,  0.28725469,  0.30242111,  0.26971354,  0.09786073,
         0.19520014,  0.31052393, -0.3090069 ,  0.307287  ,  0.27815416,
         0.29976509,  0.30553237,  0.26323762, -0.04193594,  0.03608321,
         0.05872048,  0.03801314,  0.08474   ],
       [-0.08704358,  0.13162176, -0.04614301, -0.19793126, -0.25783995,
        -0.10804563,  0.07528535, -0.01322994,  0.0875602 ,  0.12215424,
         0.07726575,  0.07150302,  0.21058205,  0.50362158, -0.01576632,
        -0.09274624, -0.50162122, -0.50761211],
       [-0.03818521, -0.20114691,  0.06346211,  0.05628517, -0.06199275,
        -0.14895782,  0.10904283, -0.09085269,  0.1060705 , -0.21368469,
         0.1445998 ,  0.11034374, -0.20287019,  0.07386402, -0.55917399,
         0.6706805 , -0.06224071, -0.04170535],
       [ 0.13867501, -0.03805548,  0.10895429, -0.25435509, -0.61276572,
         0.27867816,  0.00539295,  0.06521486,  0.03089915,  0.04146747,
        -0.06400509, -0.00219687, -0.08553965, -0.11539962,  0.47370331,
         0.42842603, -0.0274096 ,  0.09603749],
       [ 0.13710147, -0.13899555, -0.08001743,  0.13374437,  0.12360146,
        -0.63489336,  0.08555745, -0.07907344,  0.08164638, -0.25111294,
         0.14747123,  0.11010098, -0.00521211,  0.1380686 ,  0.56655224,
         0.13086982,  0.18051929, -0.11078807],
       [ 0.26361138, -0.07134742, -0.01690062, -0.13818365, -0.57782861,
        -0.289097  ,  0.09774711, -0.07572829,  0.10540323, -0.07819621,
         0.1329124 ,  0.11539822, -0.0670574 , -0.13151308, -0.31917609,
        -0.46840497,  0.28013644,  0.05944441],
       [ 0.20271711, -0.39227536,  0.16337128,  0.16191053,  0.09276331,
         0.39826629,  0.09235194, -0.1040706 ,  0.09313178, -0.35456434,
         0.06805461,  0.09011943, -0.45529272,  0.08582268,  0.12453218,
        -0.3025177 , -0.25825026, -0.17326923],
       [-0.75879641, -0.06760342,  0.27737195,  0.11054475, -0.18685876,
        -0.0462188 ,  0.06462042, -0.19234282,  0.01386846, -0.21516342,
         0.19567872,  0.03779482,  0.14675266, -0.330395  ,  0.1142554 ,
        -0.11540387, -0.09465996, -0.00649718],
       [ 0.36668592,  0.05532619,  0.07467849,  0.26666667, -0.03862966,
        -0.13716337, -0.13156766,  0.28963351, -0.0895291 , -0.15823198,
         0.04270347, -0.15107267,  0.26377133, -0.55526717, -0.05990393,
         0.05238458, -0.37916894, -0.28034051],
       [-0.16004522,  0.18232396, -0.27303378,  0.05059872,  0.03430379,
        -0.1779608 ,  0.14313264,  0.07938311,  0.2398967 ,  0.38273948,
        -0.16609091,  0.28745769, -0.54962653, -0.3625473 ,  0.05798919,
        -0.01289953, -0.18784852, -0.13340267]])
In [262]:
X_comp = pd.DataFrame(pca.components_,columns=list(X_z))
X_comp.head()
Out[262]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
0 0.272503 0.287255 0.302421 0.269714 0.097861 0.195200 0.310524 -0.309007 0.307287 0.278154 0.299765 0.305532 0.263238 -0.041936 0.036083 0.058720 0.038013 0.084740
1 -0.087044 0.131622 -0.046143 -0.197931 -0.257840 -0.108046 0.075285 -0.013230 0.087560 0.122154 0.077266 0.071503 0.210582 0.503622 -0.015766 -0.092746 -0.501621 -0.507612
2 -0.038185 -0.201147 0.063462 0.056285 -0.061993 -0.148958 0.109043 -0.090853 0.106070 -0.213685 0.144600 0.110344 -0.202870 0.073864 -0.559174 0.670680 -0.062241 -0.041705
3 0.138675 -0.038055 0.108954 -0.254355 -0.612766 0.278678 0.005393 0.065215 0.030899 0.041467 -0.064005 -0.002197 -0.085540 -0.115400 0.473703 0.428426 -0.027410 0.096037
4 0.137101 -0.138996 -0.080017 0.133744 0.123601 -0.634893 0.085557 -0.079073 0.081646 -0.251113 0.147471 0.110101 -0.005212 0.138069 0.566552 0.130870 0.180519 -0.110788
In [263]:
# P_reduce represents reduced mathematical space.
# Reducing from 17 to 10 dimension space
P_reduce = np.array(eigenvectors[0:10])   
# projecting original data into principal component dimensions
X_std_10D = np.dot(X_z,P_reduce.T)   
# converting array to dataframe for pairplot
Proj_data_df = pd.DataFrame(X_std_10D)  
In [264]:
#Let us check it visually
sns.pairplot(Proj_data_df, diag_kind='kde');
  • Now, there are almost no correlation between independent attributes but there are some attributes which shows some correlation.
  • The reason behind this is that some attributes in data are less correlated but we still taken that for dimentionality reduction.
  • The solution may be we can remove columns which are less correlated then apply PCA.
In [265]:
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(Proj_data_df,y, test_size = 0.3, random_state = 10)

Logistic Regression

In [266]:
model = LogisticRegression()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
In [267]:
# check the accuracy on the training data
print('Accuracy on Training data: ',model.score(X_train, y_train))
# check the accuracy on the testing data
print('Accuracy on Testing data: ',model.score(X_test , y_test))
#Calculate the recall value 
print('Recall value: ',metrics.recall_score(y_test, prediction, average='macro'))
#Calculate the precision value 
print('Precision value: ',metrics.precision_score(y_test, prediction, average='macro'))
print("Confusion Matrix:\n",metrics.confusion_matrix(prediction,y_test))
print("Classification Report:\n",metrics.classification_report(prediction,y_test))
Accuracy on Training data:  0.8614864864864865
Accuracy on Testing data:  0.8543307086614174
Recall value:  0.863621175327829
Precision value:  0.8442223075975152
Confusion Matrix:
 [[ 60   9   2]
 [  9 104   3]
 [  2  12  53]]
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.85      0.85        71
           1       0.83      0.90      0.86       116
           2       0.91      0.79      0.85        67

    accuracy                           0.85       254
   macro avg       0.86      0.84      0.85       254
weighted avg       0.86      0.85      0.85       254

  • The accuracy, precision and recall value is reduced after applying PCA because we have reduced the dimensions.
  • Despite giving less accuracy, precision and recall, this model is better as it has taken into consideration the relationship between the independent varibales and reduced the columns which are highly correlated.
  • This model is performing well compared to Naive Bayes'and Naive Bayes' k fold models.
In [268]:
resultsDf=pd.DataFrame({'Model':['Logistic'],'Accuracy': model.score(X_test , y_test)},index={'1'})
resultsDf=resultsDf[['Model','Accuracy']]
resultsDf
Out[268]:
Model Accuracy
1 Logistic 0.854331

Naive Bayes' Classifier

In [269]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
In [270]:
# check the accuracy on the training data
print('Accuracy on Training data: ',model.score(X_train, y_train))
# check the accuracy on the testing data
print('Accuracy on Testing data: ', model.score(X_test , y_test))
#Calculate the recall value 
print('Recall value: ',metrics.recall_score(y_test, prediction, average='macro'))
#Calculate the precision value 
print('Precision value: ',metrics.precision_score(y_test, prediction, average='macro'))
print("Confusion Matrix:\n",metrics.confusion_matrix(prediction,y_test))
print("Classification Report:\n",metrics.classification_report(prediction,y_test))
Accuracy on Training data:  0.6469594594594594
Accuracy on Testing data:  0.6417322834645669
Recall value:  0.6618565646754088
Precision value:  0.7433265993265993
Confusion Matrix:
 [[30  0  0]
 [16 79  4]
 [25 46 54]]
Classification Report:
               precision    recall  f1-score   support

           0       0.42      1.00      0.59        30
           1       0.63      0.80      0.71        99
           2       0.93      0.43      0.59       125

    accuracy                           0.64       254
   macro avg       0.66      0.74      0.63       254
weighted avg       0.75      0.64      0.64       254

  • This model is not performing well as compared to other models.
In [271]:
#Store the accuracy results for each kernel in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Model':['Naive Bayes'], 'Accuracy': model.score(X_test, y_test)},index={'2'})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Model','Accuracy']]
resultsDf
Out[271]:
Model Accuracy
1 Logistic 0.854331
2 Naive Bayes 0.641732

Using k fold cross validation in Naive Bayes

In [272]:
#Use the Naive Bayes CLassifier with k fold cross validation
scores = cross_val_score(model, Proj_data_df, y, cv=10)
print(scores)
print('Average score: ', np.mean(scores))
[0.63529412 0.55294118 0.6        0.64705882 0.55294118 0.75294118
 0.63095238 0.60714286 0.6547619  0.64285714]
Average score:  0.6276890756302522
In [273]:
#Store the accuracy results for each kernel in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Model':['Naive Bayes k fold'], 'Accuracy': np.mean(scores)},index={'3'})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Model','Accuracy']]
resultsDf
Out[273]:
Model Accuracy
1 Logistic 0.854331
2 Naive Bayes 0.641732
3 Naive Bayes k fold 0.627689
  • The model is not performing well as compared to other models.

Support Vector Classifier

In [274]:
clf = SVC()
clf.fit(X_train, y_train)
prediction = model.predict(X_test)
In [275]:
# check the accuracy on the training data
print('Accuracy on Training data: ',model.score(X_train, y_train))
# check the accuracy on the testing data
print('Accuracy on Testing data: ', model.score(X_test , y_test))
#Calculate the recall value 
print('Recall value: ',metrics.recall_score(y_test, prediction, average='macro'))
#Calculate the precision value 
print('Precision value: ',metrics.precision_score(y_test, prediction, average='macro'))
print("Confusion Matrix:\n",metrics.confusion_matrix(prediction,y_test))
print("Classification Report:\n",metrics.classification_report(prediction,y_test))
Accuracy on Training data:  0.6469594594594594
Accuracy on Testing data:  0.6417322834645669
Recall value:  0.6618565646754088
Precision value:  0.7433265993265993
Confusion Matrix:
 [[30  0  0]
 [16 79  4]
 [25 46 54]]
Classification Report:
               precision    recall  f1-score   support

           0       0.42      1.00      0.59        30
           1       0.63      0.80      0.71        99
           2       0.93      0.43      0.59       125

    accuracy                           0.64       254
   macro avg       0.66      0.74      0.63       254
weighted avg       0.75      0.64      0.64       254

  • This model is not performing well.
In [276]:
#Store the accuracy results for each kernel in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Model':['SVM'], 'Accuracy': model.score(X_test, y_test)},index={'4'})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Model','Accuracy']]
resultsDf
Out[276]:
Model Accuracy
1 Logistic 0.854331
2 Naive Bayes 0.641732
3 Naive Bayes k fold 0.627689
4 SVM 0.641732

Using Grid Search to tune model parameters

In [292]:
#Grid search to tune model parameters for SVC
from sklearn.model_selection import GridSearchCV
model = SVC()
params = {'C': [0.01, 0.1, 0.5, 1], 'kernel': ['linear', 'rbf']}
model1 = GridSearchCV(model, param_grid=params, verbose=5)
model1.fit(X_train, y_train)
print("Best Hyper Parameters:\n", model1.best_params_)
Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV 1/5] END .............C=0.01, kernel=linear;, score=0.941 total time=   0.2s
[CV 2/5] END .............C=0.01, kernel=linear;, score=0.874 total time=   0.0s
[CV 3/5] END .............C=0.01, kernel=linear;, score=0.907 total time=   0.0s
[CV 4/5] END .............C=0.01, kernel=linear;, score=0.983 total time=   0.0s
[CV 5/5] END .............C=0.01, kernel=linear;, score=0.932 total time=   0.0s
[CV 1/5] END ................C=0.01, kernel=rbf;, score=0.513 total time=   0.0s
[CV 2/5] END ................C=0.01, kernel=rbf;, score=0.513 total time=   0.0s
[CV 3/5] END ................C=0.01, kernel=rbf;, score=0.517 total time=   0.0s
[CV 4/5] END ................C=0.01, kernel=rbf;, score=0.517 total time=   0.0s
[CV 5/5] END ................C=0.01, kernel=rbf;, score=0.508 total time=   0.0s
[CV 1/5] END ..............C=0.1, kernel=linear;, score=0.950 total time=   0.0s
[CV 2/5] END ..............C=0.1, kernel=linear;, score=0.891 total time=   0.0s
[CV 3/5] END ..............C=0.1, kernel=linear;, score=0.949 total time=   0.0s
[CV 4/5] END ..............C=0.1, kernel=linear;, score=0.966 total time=   0.0s
[CV 5/5] END ..............C=0.1, kernel=linear;, score=0.932 total time=   0.0s
[CV 1/5] END .................C=0.1, kernel=rbf;, score=0.538 total time=   0.0s
[CV 2/5] END .................C=0.1, kernel=rbf;, score=0.504 total time=   0.0s
[CV 3/5] END .................C=0.1, kernel=rbf;, score=0.517 total time=   0.0s
[CV 4/5] END .................C=0.1, kernel=rbf;, score=0.517 total time=   0.0s
[CV 5/5] END .................C=0.1, kernel=rbf;, score=0.517 total time=   0.0s
[CV 1/5] END ..............C=0.5, kernel=linear;, score=0.933 total time=   0.1s
[CV 2/5] END ..............C=0.5, kernel=linear;, score=0.891 total time=   0.0s
[CV 3/5] END ..............C=0.5, kernel=linear;, score=0.949 total time=   0.0s
[CV 4/5] END ..............C=0.5, kernel=linear;, score=0.949 total time=   0.5s
[CV 5/5] END ..............C=0.5, kernel=linear;, score=0.924 total time=   0.1s
[CV 1/5] END .................C=0.5, kernel=rbf;, score=0.580 total time=   0.0s
[CV 2/5] END .................C=0.5, kernel=rbf;, score=0.538 total time=   0.0s
[CV 3/5] END .................C=0.5, kernel=rbf;, score=0.627 total time=   0.0s
[CV 4/5] END .................C=0.5, kernel=rbf;, score=0.644 total time=   0.0s
[CV 5/5] END .................C=0.5, kernel=rbf;, score=0.585 total time=   0.0s
[CV 1/5] END ................C=1, kernel=linear;, score=0.941 total time=   0.4s
[CV 2/5] END ................C=1, kernel=linear;, score=0.882 total time=   0.0s
[CV 3/5] END ................C=1, kernel=linear;, score=0.949 total time=   0.4s
[CV 4/5] END ................C=1, kernel=linear;, score=0.949 total time=   0.2s
[CV 5/5] END ................C=1, kernel=linear;, score=0.932 total time=   0.4s
[CV 1/5] END ...................C=1, kernel=rbf;, score=0.689 total time=   0.0s
[CV 2/5] END ...................C=1, kernel=rbf;, score=0.588 total time=   0.0s
[CV 3/5] END ...................C=1, kernel=rbf;, score=0.636 total time=   0.0s
[CV 4/5] END ...................C=1, kernel=rbf;, score=0.653 total time=   0.0s
[CV 5/5] END ...................C=1, kernel=rbf;, score=0.602 total time=   0.0s
Best Hyper Parameters:
 {'C': 0.1, 'kernel': 'linear'}
In [294]:
print(" Results from Grid Search " )
print("\n The best estimator across ALL searched params:\n",model1.best_estimator_)
print("\n The best score across ALL searched params:\n",model1.best_score_)
print("\n The best parameters across ALL searched params:\n",model1.best_params_)
 Results from Grid Search 

 The best estimator across ALL searched params:
 SVC(C=0.1, kernel='linear')

 The best score across ALL searched params:
 0.9375587523144852

 The best parameters across ALL searched params:
 {'C': 0.1, 'kernel': 'linear'}

B. Share best Parameters observed from above step.

  • The best estimator across ALL searched params: SVC(C=0.1, kernel='linear')

  • The best score across ALL searched params: 0.9375587523144852

  • The best parameters across ALL searched params: {'C': 0.1, 'kernel': 'linear'}

In [278]:
#Build the model with the best hyper parameters
model = SVC(C=0.5, kernel="linear")
scores = cross_val_score(model, Proj_data_df, y, cv=10)
print(scores)
print(np.mean(scores))
[0.81176471 0.8        0.85882353 0.85882353 0.87058824 0.84705882
 0.85714286 0.9047619  0.88095238 0.88095238]
0.8570868347338936
In [279]:
#Store the accuracy results for each kernel in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Model':['SVM k fold'], 'Accuracy': np.mean(scores)},index={'5'})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Model','Accuracy']]
resultsDf
Out[279]:
Model Accuracy
1 Logistic 0.854331
2 Naive Bayes 0.641732
3 Naive Bayes k fold 0.627689
4 SVM 0.641732
5 SVM k fold 0.857087
  • We can see that Logistic Regression and SVM k fold is given better results as compared to other models so we can use either of the two models to predict the silhouette as one of three types of vehicle.
In [280]:
#splitting the data in test and train sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 10)
In [281]:
# scaling the data using the standard scaler
from sklearn.preprocessing import StandardScaler
X_train_sd = StandardScaler().fit_transform(X_train)
X_test_sd = StandardScaler().fit_transform(X_test)
In [282]:
# generating the covariance matrix and the eigen values for the PCA analysis
cov_matrix = np.cov(X_train_sd.T) # the relevanat covariance matrix
print('Covariance Matrix \n%s', cov_matrix)

#generating the eigen values and the eigen vectors
e_vals, e_vecs = np.linalg.eig(cov_matrix)
print('Eigenvectors \n%s' %e_vecs)
print('\nEigenvalues \n%s' %e_vals)
Covariance Matrix 
%s [[ 1.00169205  0.69510999  0.79093174  0.72835184  0.21888501  0.51630765
   0.81466568 -0.79249053  0.81562662  0.69493762  0.76965639  0.8079382
   0.6018784  -0.23292845  0.1777701   0.16129088  0.29107259  0.36229106]
 [ 0.69510999  1.00169205  0.80101871  0.66245848  0.23982668  0.55826096
   0.85124238 -0.82911962  0.84353731  0.96223666  0.80730936  0.83188956
   0.92490531  0.06127477  0.12423951  0.0144264  -0.08843129  0.06521593]
 [ 0.79093174  0.80101871  1.00169205  0.7978868   0.26287468  0.6778587
   0.90364003 -0.91115044  0.89344273  0.79364077  0.86732697  0.88272575
   0.72188889 -0.20989583  0.09286116  0.26881961  0.13214005  0.3236881 ]
 [ 0.72835184  0.66245848  0.7978868   1.00169205  0.66543332  0.48358479
   0.77875874 -0.8336149   0.75508573  0.61730125  0.78571466  0.76638865
   0.57420016 -0.37127434  0.02851743  0.18594076  0.39856553  0.4918137 ]
 [ 0.21888501  0.23982668  0.26287468  0.66543332  1.00169205  0.18653387
   0.21959938 -0.32322592  0.19042837  0.19092675  0.21971696  0.21647296
   0.16973751 -0.32819617 -0.06393545 -0.0160431   0.41444946  0.44062526]
 [ 0.51630765  0.55826096  0.6778587   0.48358479  0.18653387  1.00169205
   0.49615616 -0.51094144  0.4935197   0.63472122  0.40676676  0.46554734
   0.39466359 -0.35309648  0.06999148  0.1660827   0.09975973  0.42710342]
 [ 0.81466568  0.85124238  0.90364003  0.77875874  0.21959938  0.49615616
   1.00169205 -0.9725199   0.99059693  0.82270375  0.9584225   0.98422714
   0.81084337  0.0299434   0.05983065  0.21952868 -0.00767062  0.11103332]
 [-0.79249053 -0.82911962 -0.91115044 -0.8336149  -0.32322592 -0.51094144
  -0.9725199   1.00169205 -0.95013386 -0.79326652 -0.94417927 -0.95082111
  -0.77994992  0.06408998 -0.04432899 -0.19656076 -0.10243634 -0.21194754]
 [ 0.81562662  0.84353731  0.89344273  0.75508573  0.19042837  0.4935197
   0.99059693 -0.95013386  1.00169205  0.82210611  0.94516704  0.97701052
   0.80471323  0.04391383  0.06706305  0.2241005  -0.0279134   0.09573866]
 [ 0.69493762  0.96223666  0.79364077  0.61730125  0.19092675  0.63472122
   0.82270375 -0.79326652  0.82210611  1.00169205  0.76756616  0.80402381
   0.87026942  0.03994679  0.11672182  0.02772322 -0.082233    0.09842759]
 [ 0.76965639  0.80730936  0.86732697  0.78571466  0.21971696  0.40676676
   0.9584225  -0.94417927  0.94516704  0.76756616  1.00169205  0.94446271
   0.79797753  0.04745616  0.02237996  0.20973283 -0.00256406  0.07666116]
 [ 0.8079382   0.83188956  0.88272575  0.76638865  0.21647296  0.46554734
   0.98422714 -0.95082111  0.97701052  0.80402381  0.94446271  1.00169205
   0.79721866  0.0340486   0.06429078  0.21036607  0.00118648  0.10727801]
 [ 0.6018784   0.92490531  0.72188889  0.57420016  0.16973751  0.39466359
   0.81084337 -0.77994992  0.80471323  0.87026942  0.79797753  0.79721866
   1.00169205  0.21285884  0.15273557 -0.03728709 -0.21250932 -0.10244546]
 [-0.23292845  0.06127477 -0.20989583 -0.37127434 -0.32819617 -0.35309648
   0.0299434   0.06408998  0.04391383  0.03994679  0.04745616  0.0340486
   0.21285884  1.00169205 -0.04119672 -0.10539696 -0.83962739 -0.90384098]
 [ 0.1777701   0.12423951  0.09286116  0.02851743 -0.06393545  0.06999148
   0.05983065 -0.04432899  0.06706305  0.11672182  0.02237996  0.06429078
   0.15273557 -0.04119672  1.00169205 -0.03069198  0.06833689  0.04382118]
 [ 0.16129088  0.0144264   0.26881961  0.18594076 -0.0160431   0.1660827
   0.21952868 -0.19656076  0.2241005   0.02772322  0.20973283  0.21036607
  -0.03728709 -0.10539696 -0.03069198  1.00169205  0.04730685  0.17310175]
 [ 0.29107259 -0.08843129  0.13214005  0.39856553  0.41444946  0.09975973
  -0.00767062 -0.10243634 -0.0279134  -0.082233   -0.00256406  0.00118648
  -0.21250932 -0.83962739  0.06833689  0.04730685  1.00169205  0.89853589]
 [ 0.36229106  0.06521593  0.3236881   0.4918137   0.44062526  0.42710342
   0.11103332 -0.21194754  0.09573866  0.09842759  0.07666116  0.10727801
  -0.10244546 -0.90384098  0.04382118  0.17310175  0.89853589  1.00169205]]
Eigenvectors 
[[ 2.72080885e-01  8.34775563e-02  1.43861268e-01  1.42296437e-02
  -1.06156993e-01  2.78692394e-01  2.15688191e-01 -7.61377690e-01
   3.65553518e-01  1.34319298e-01  6.96400549e-02  3.59446164e-02
  -1.88521436e-02 -7.78841886e-02  3.92426440e-02  9.26174763e-02
  -8.89209702e-02  2.62627208e-02]
 [ 2.87332176e-01 -1.16568781e-01 -4.61041984e-03  2.00010607e-01
   1.50325443e-01 -8.73306306e-02 -4.04168704e-01 -8.51074659e-02
   3.27074833e-02 -1.86811110e-01 -2.86703548e-02  1.29605519e-01
  -4.51737836e-02 -1.29901469e-01  3.38951331e-01 -4.73214871e-02
   3.67200020e-01  5.81842286e-01]
 [ 3.01097011e-01  3.99245109e-02  9.78114408e-02 -8.69150172e-02
   8.91766063e-02 -1.14219108e-02  1.48081127e-01  2.85723938e-01
   8.92917420e-02  3.60613054e-01  2.94734799e-01  7.18169112e-01
   1.38367405e-02  1.38925669e-01 -3.58781320e-02 -1.28673066e-01
   3.13956225e-02 -2.11288885e-03]
 [ 2.71740645e-01  1.91630651e-01 -2.36186426e-01 -3.26150604e-03
  -1.67008796e-01 -1.42444491e-01  1.48728033e-01  1.16024253e-01
   2.78679004e-01 -1.35726297e-01 -4.85673653e-01 -2.96345608e-03
  -3.86287919e-03  3.94700295e-01  4.38429261e-01 -1.87391321e-01
  -8.72113033e-02 -1.64802759e-01]
 [ 1.06743003e-01  2.63972684e-01 -5.47607605e-01  1.72593979e-01
  -1.84727140e-01 -5.98319199e-01  1.08235819e-01 -1.80011377e-01
  -3.28768182e-02  9.45088120e-03  2.44842390e-01  2.41150726e-02
   1.16861020e-02 -1.48278238e-01 -2.45895162e-01  6.87322599e-02
   4.84455341e-02  4.09200757e-02]
 [ 1.95948139e-01  1.25566903e-01  2.78234491e-01  3.15789170e-02
   6.42589536e-01 -3.11457282e-01  3.92486847e-01 -4.06462868e-02
  -1.12581374e-01  1.42092976e-01 -1.79077297e-01 -2.77160268e-01
  -1.72048465e-04 -1.05094153e-01 -5.31067502e-02 -1.88634455e-01
   1.02799703e-01  5.76863722e-03]
 [ 3.09080827e-01 -8.10431284e-02 -1.62141102e-02 -9.41668448e-02
  -8.66246077e-02  1.04364217e-01  1.03546570e-01  6.75425608e-02
  -1.47827063e-01 -1.41705956e-01  1.45427537e-01 -1.35164569e-01
   8.60109831e-01 -5.25663141e-02  7.47250853e-02  7.37517687e-02
   1.20453130e-01 -7.51987644e-02]
 [-3.08018498e-01  1.75872930e-02  7.52663018e-02  6.62360536e-02
   9.02730254e-02 -7.20423879e-02 -1.05287003e-01 -2.00483555e-01
   2.59058931e-01 -7.93181695e-02 -2.48369274e-02  3.97965538e-02
   2.63577953e-01  5.92073597e-01 -4.08656517e-01 -2.17939136e-01
   3.27467224e-01  1.13067707e-01]
 [ 3.05826258e-01 -9.10230352e-02  7.47259150e-03 -9.92164364e-02
  -7.98214200e-02  1.13528303e-01  1.07341677e-01  1.70217852e-02
  -1.29336150e-01 -2.19977379e-01  2.32268686e-01 -1.57143274e-01
  -3.98831476e-01  2.05119165e-01 -5.06094445e-02  1.49440803e-01
   5.98904600e-01 -3.58791876e-01]
 [ 2.81041003e-01 -1.04150058e-01  6.14846316e-02  1.86476205e-01
   2.56377378e-01 -8.50407290e-02 -3.56245995e-01 -2.08973860e-01
  -1.94842799e-01 -3.68031033e-01 -1.68683398e-01  3.07321628e-01
   1.75081711e-02  4.48535535e-02 -2.48516000e-01  6.73272304e-02
  -3.21372321e-01 -3.99836821e-01]
 [ 2.97813565e-01 -8.66686013e-02 -8.01303955e-02 -1.13857951e-01
  -1.51471943e-01  1.42652179e-01  5.69170349e-02  2.02171148e-01
   9.91826949e-02  1.12097632e-01 -5.44473491e-01  2.80194877e-03
  -2.92675160e-02 -1.29829781e-01 -5.58564129e-01  2.65894760e-01
   9.63645464e-02  2.61630063e-01]
 [ 3.04206101e-01 -8.06955945e-02 -2.11958403e-02 -9.08890299e-02
  -1.15375529e-01  1.27147385e-01  1.02260936e-01  4.58231241e-02
  -1.85692283e-01 -2.30693784e-01  2.73912721e-01 -2.40701754e-01
  -1.63711470e-01  2.38832751e-01 -2.08939302e-01 -4.38353077e-01
  -4.07852184e-01  3.78866755e-01]
 [ 2.64969998e-01 -2.00512459e-01 -3.92494563e-02  2.21647677e-01
   1.92972623e-02 -5.91530334e-02 -4.50419633e-01  1.26648888e-01
   2.92853887e-01  5.20093826e-01  1.48425521e-01 -4.13945273e-01
   1.55429142e-02  7.22529620e-02 -4.59157272e-02 -7.32152662e-02
  -7.97194648e-02 -2.17256489e-01]
 [-3.77750383e-02 -5.06747105e-01 -1.26364120e-01 -3.02325653e-02
  -1.45674808e-01 -1.30353133e-01  9.47071020e-02 -3.18669789e-01
  -5.26958506e-01  4.00202718e-01 -2.18576643e-01  8.58077331e-02
   5.09372000e-03  2.45857280e-01  1.34288897e-01 -3.72911444e-02
   4.07015715e-02  4.03343318e-02]
 [ 3.22518808e-02  6.49212998e-03  6.42390934e-01  4.48926637e-01
  -5.14867288e-01 -2.97197230e-01  1.10548341e-01  1.03635739e-01
  -5.26319602e-02 -5.25007490e-02 -3.00792974e-02  4.89890327e-05
   3.89253223e-04 -7.09356001e-03 -2.05130863e-02  3.46355031e-02
  -1.62823117e-03  1.53579702e-02]
 [ 6.24262161e-02  7.16499411e-02  2.86797037e-01 -7.65038775e-01
  -1.67024520e-01 -4.12249837e-01 -3.16051479e-01 -1.24644641e-01
   4.69819730e-02  6.13957090e-03 -1.44466793e-02 -3.11550038e-02
  -7.34871164e-03 -6.36706171e-02  5.85973118e-03 -4.65230198e-02
  -7.86997718e-03 -1.91699266e-02]
 [ 3.61479464e-02  5.03000142e-01 -1.26704943e-02  7.33524119e-02
  -1.86669253e-01  2.89023524e-01 -2.37884531e-01 -9.13627949e-02
  -3.43333134e-01  1.90707372e-01 -1.73145328e-01  3.14918129e-02
   2.51562534e-02 -1.96907767e-01 -8.18903828e-02 -5.04561404e-01
   2.25315377e-01 -1.41943989e-01]
 [ 8.41770112e-02  5.10355254e-01  8.61463311e-02  1.02197003e-02
   9.90431183e-02  6.53534764e-02 -1.60729448e-01  1.46000574e-03
  -3.02707613e-01  1.80275709e-01  5.49291150e-02 -1.05453106e-01
   3.61375835e-03  4.34855864e-01  8.31976684e-02  5.38377437e-01
  -1.32542015e-01  2.08058715e-01]]

Eigenvalues 
[9.82652956e+00 3.34510354e+00 1.11901533e+00 1.16336788e+00
 8.65058498e-01 6.64811390e-01 3.20576109e-01 2.28294790e-01
 1.21712809e-01 8.12281281e-02 7.22711430e-02 6.62987147e-02
 5.70975207e-03 4.20526870e-02 3.45573328e-02 2.90514705e-02
 2.11481067e-02 2.36696085e-02]
In [283]:
# the "cumulative variance explained" analysis 
tot = sum(e_vals)
var_exp = [( i /tot ) * 100 for i in sorted(e_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
Cumulative Variance Explained [ 54.49961499  73.05213179  79.50436917  85.71061975  90.50838226
  94.19554002  95.97351     97.23967198  97.91471205  98.36521716
  98.76604529  99.13374931  99.36698071  99.55864159  99.71976602
  99.85104172  99.96833274 100.        ]
In [284]:
# Plotting the variance expalained by the principal components and the cumulative variance explained.
plt.figure(figsize=(10 , 5))
plt.bar(range(1, e_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, e_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()
In [285]:
eigen_pairs = [(np.abs(e_vals[i]), e_vecs[:,i]) for i in range(len(e_vals))]
eigen_pairs.sort(reverse=True)
eigen_pairs[:5]
Out[285]:
[(9.826529564948988,
  array([ 0.27208088,  0.28733218,  0.30109701,  0.27174064,  0.106743  ,
          0.19594814,  0.30908083, -0.3080185 ,  0.30582626,  0.281041  ,
          0.29781357,  0.3042061 ,  0.26497   , -0.03777504,  0.03225188,
          0.06242622,  0.03614795,  0.08417701])),
 (3.34510353659071,
  array([ 0.08347756, -0.11656878,  0.03992451,  0.19163065,  0.26397268,
          0.1255669 , -0.08104313,  0.01758729, -0.09102304, -0.10415006,
         -0.0866686 , -0.08069559, -0.20051246, -0.50674711,  0.00649213,
          0.07164994,  0.50300014,  0.51035525])),
 (1.1633678770584706,
  array([ 0.01422964,  0.20001061, -0.08691502, -0.00326151,  0.17259398,
          0.03157892, -0.09416684,  0.06623605, -0.09921644,  0.18647621,
         -0.11385795, -0.09088903,  0.22164768, -0.03023257,  0.44892664,
         -0.76503878,  0.07335241,  0.0102197 ])),
 (1.11901533405019,
  array([ 0.14386127, -0.00461042,  0.09781144, -0.23618643, -0.5476076 ,
          0.27823449, -0.01621411,  0.0752663 ,  0.00747259,  0.06148463,
         -0.0801304 , -0.02119584, -0.03924946, -0.12636412,  0.64239093,
          0.28679704, -0.01267049,  0.08614633])),
 (0.8650584983404681,
  array([-0.10615699,  0.15032544,  0.08917661, -0.1670088 , -0.18472714,
          0.64258954, -0.08662461,  0.09027303, -0.07982142,  0.25637738,
         -0.15147194, -0.11537553,  0.01929726, -0.14567481, -0.51486729,
         -0.16702452, -0.18666925,  0.09904312]))]
In [286]:
# generating dimensionally reduced datasets
w = np.hstack((eigen_pairs[0][1].reshape(-1,1), eigen_pairs[1][1].reshape(-1,1)))
print('Matrix W:\n', w)
X_sd_pca = X_train_sd.dot(w)
X_test_sd_pca = X_test_sd.dot(w)
Matrix W:
 [[ 0.27208088  0.08347756]
 [ 0.28733218 -0.11656878]
 [ 0.30109701  0.03992451]
 [ 0.27174064  0.19163065]
 [ 0.106743    0.26397268]
 [ 0.19594814  0.1255669 ]
 [ 0.30908083 -0.08104313]
 [-0.3080185   0.01758729]
 [ 0.30582626 -0.09102304]
 [ 0.281041   -0.10415006]
 [ 0.29781357 -0.0866686 ]
 [ 0.3042061  -0.08069559]
 [ 0.26497    -0.20051246]
 [-0.03777504 -0.50674711]
 [ 0.03225188  0.00649213]
 [ 0.06242622  0.07164994]
 [ 0.03614795  0.50300014]
 [ 0.08417701  0.51035525]]
In [287]:
X_train_sd.shape, w.shape, X_sd_pca.shape, X_test_sd_pca.shape
Out[287]:
((592, 18), (18, 2), (592, 2), (254, 2))
In [288]:
from sklearn.svm import SVC

clf = SVC()
clf.fit(X_train_sd, y_train)
print ('Before PCA score', clf.score(X_test_sd, y_test))

clf.fit(X_sd_pca, y_train)
print ('After PCA score', clf.score(X_test_sd_pca, y_test))
Before PCA score 0.9566929133858267
After PCA score 0.6181102362204725
In [289]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_sd, y_train)
print ('Before PCA score', model.score(X_test_sd, y_test))

model.fit(X_sd_pca, y_train)
print ('After PCA score', model.score(X_test_sd_pca, y_test))
Before PCA score 0.9488188976377953
After PCA score 0.5551181102362205
In [290]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train_sd, y_train)
print ('Before PCA score', model.score(X_test_sd, y_test))

model.fit(X_sd_pca, y_train)
print ('After PCA score', model.score(X_test_sd_pca, y_test))
Before PCA score 0.5984251968503937
After PCA score 0.531496062992126

5. Data Understanding & Cleaning:

5.A. Explain pre-requisite/assumptions of PCA.

  • There must be linearity in the data set, i.e. the variables combine in a linear manner to form the dataset. The variables exhibit relationships among themselves.
  • PCA assumes that the principal component with high variance must be paid attention and the PCs with lower variance are disregarded as noise. Pearson correlation coefficient framework led to the origin of PCA, and there it was assumed first that the axes with high variance would only be turned into principal components.

5.B. Explain advantages and limitations of PCA.

Advantages of PCA:

  • Highlight similarities and differences in data
  • Reduce dimensions without much information loss
  • Reduces Overfitting
  • Improves visualization

limitations of PCA:

  • PCA is not robust against outliers: Similar to the point above, the algorithm will be biased in datasets with strong outliers. This is why it is recommended to remove outliers before performing PCA.
  • PCA assumes a correlation between features: If the features (or dimensions or columns, in tabular data) are not correlated, PCA will be unable to determine principal components.